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Abstract 



Due to the success of the bag-of-word modeling paradigm, clustering histograms 
has become an important ingredient of modern information processing. Clustering 
histograms can be performed using the celebrated /c-means centroid-based algorithm. 
From the viewpoint of applications, it is usually required to deal with symmetric 
distances. In this letter, we consider the Jeffreys divergence that symmetrizes the 
Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We 
^ , first prove that the Jeffreys centroid can be expressed analytically using the Lambert 

W function for positive histograms. We then show how to obtain a fast guaranteed ap- 
proximation when dealing with frequency histograms. Finally, we conclude with some 
£N . remarks on the fe-means histogram clustering. 

00 

cn : 1 Introduction: Motivation and prior work 

1.1 Motivation: The Bag-of-Word modeling paradigm 

cn ■ 

Classifying documents into categories is a common task of information retrieval systems: 
Given a training set of documents labeled with categories, one asks to classify incoming 
new documents. Text categorization [1] proceeds by first defining a dictionary of words (the 
corpus). It then models each document by a word count yielding a word histogram per 



document. Denning a proper distance d(-, •) between histograms allows one to: 

• Classify a new on-line document: we first calculate its histogram signature and then 
seek for the labeled document which has the most similar histogram to deduce its tag 
(e.g., using a nearest neighbor classifier). 

• Find the initial set of categories: we cluster all document histograms and assign a 
category per cluster. 

It has been shown experimentally that the Jeffreys divergence (symmetrizing the 
Kullback-Leibler divergence) achieves better performance than the traditional tf-idf 
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method pp. This text classification method based on the Bag of Words (BoWs) repre- 
sentation has also been instrumental in computer vision for efficient object categorization [2] 
and recognition in natural images. It first requires to create a dictionary of "visual words" 
by quantizing keypoints of the training database. Quantization is then performed using the 
&;-means algorithm [8] that partitions n data points X = {x\, ...,£„} into k clusters C\, 
where each data element belongs to the closest cluster center. From a given initialization, 
Lloyd's batched fc-means first assigns points to their closest cluster centers, and then update 
the cluster centers, and reiterate this process until convergence is met after a finite number 
of steps. When the distance function d(x, y) is chosen as the squared Euclidean distance 
d(x,y) = \\x — y\\ 2 , the cluster centroids updates to their centers of mass. Csurka et al. [2] 
used the squared Euclidean distance for building the visual vocabulary. Depending on the 
considered features, other distances have proven useful: For example, the Jeffreys diver- 
gence was shown to perform experimentally better than the Euclidean or squared Euclidean 
distances for Compressed Histogram of Gradient descriptors [3]. To summarize, k- means his- 
togram clustering with respect to the Jeffreys divergence can be used to both quantize visual 
words to create a dictionary and to cluster document words for assigning initial categories. 

Let Wh = Ylt=i denote the cumulative sum of the bin values of histogram h. We dis- 
tinguish between positive histograms and frequency histograms. A frequency histogram h is 
a unit histogram (i.e., the cumulative sum of its bins adds up to one). In statistics, those 
positive and frequency histograms correspond respectively to positive discrete and multino- 
mial distributions when all bins are non-empty. Let H = {hi, h n } be a collection of n 
histograms with d positive- valued bins. By notational convention, we use the superscript and 
the subscript to indicate the bin number and the histogram number, respectively. Without 
loss of generality, we assume that all bins are non-emptyS: hi > 0,1 < j < n, 1 < i < d. 
To measure the distance between two such histograms p and q, we rely on the relative 
entropy. The extended KL divergence [8] between two positive (but not necessarily nor- 
malized) histograms p and q is defined by KL(p : q) = ^2i=iP % ^°8^ + Q l — V ''■ Ob- 
serve that this information-theoretic dissimilarity measure is not symmetric nor does it 
satisfy the triangular inequality property of metrics. Let p = =£■ — - and q = =^ — - 

denote the corresponding normalized frequency histograms. In the remainder, the de- 
notes this normalization operator. The extended KL divergence formula applied to nor- 
malized histograms yields the traditional KL divergence [8]: KL(j> : q) = Y^i=iP l 1°§ since 

Ylt=i Q l —p l = Zli=i Q l ~ J2t=iP l = 1 — 1 = 0. The KL divergence is interpreted as the relative 
entropy between p and q: KL(p : q) = H x (p : q) — H(p), where H x (p : q) = J2t=iP l ^°& jjt 

denotes the cross-entropy and H(p) = H y (p : p) = Yli=iP l ^°&^ ^ s the Shannon entropy. 
This distance is explained as the expected extra number of bits per datum that must be 
transmitted when using the "wrong" distribution q instead of the true distribution p. Often 
p is hidden by nature and need to be hypothesized while q is estimated. When clustering 
histograms, all histograms play the same role, and it is therefore better to consider the 

1 Otherwise, we may add an arbitrary small quantity e > to all bins. When frequency histograms are 
required, we then re-normalize. 
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Jeffreys [1] divergence J(p, q) = KL(p : q) + KL(q : p) that symmetrizes the KL divergence: 

d ^ 

J(P, Q) = - 9*) log ^ = (1) 

i=i Q 

Observe that the formula for Jeffreys divergence holds for arbitrary positive histograms 
(including frequency histograms). 

This letter is devoted to compute efficiently the Jeffreys centroid c of a set H = {hi, h n } 
of weighted histograms defined as: 

n 

c = arg min > itj J(hj , x) , (2) 

X ' J 

3=1 

where the 7Tj's denote the histogram positive weights (with X^=i = -0- When all his- 
tograms hj e H are normalized, we require the minimization of x to be carried out over A<j, 
the (d — 1) -dimensional probability simplex. This yields the Jeffreys frequency centroid c. 
Otherwise, for positive histograms hj G H, the minimization of x is done over the positive 
orthant Rl, to get the Jeffreys positive centroid c. Since the J-divergence is convex in both 
arguments, both the Jeffreys positive and frequency centroids are unique. 



1.2 Prior work and contributions 

On one hand, clustering histograms has been studied using various distances and cluster- 
ing algorithms. Besides the classical Minkowski £ p -norm distances, hierarchical clustering 
with respect to the x 2 distance has been investigated in [7]. Banerjee et al. [8] generalized 
fc-means to Bregman &-means thus allowing to cluster distributions of the same exponential 
families with respect to the KL divergence. Mignotte [9] used /c-means with respect to the 
Bhattacharyya distance [10] on histograms of various color spaces to perform image segmen- 
tation. On the other hand, Jeffreys fc-means has not been yet extensively studied as the 
centroid computations are non-trivial: In 2002, Veldhuis [UJ reported an iterative Newton- 
like algorithm to approximate arbitrarily finely the Jeffreys frequency centroid c of a set 
of frequency histograms that requires two nested loops. Nielsen and Nock [12] considered 
the information-geometric structure of the manifold of multinomials (frequency histograms) 
to report a simple geodesic bisection search algorithm (i.e., replacing the two nested loops 
of [H] by one single loop). Indeed, the family of frequency histograms belongs to the expo- 
nential families [8], and the Jeffreys frequency centroid amount to compute equivalently a 
symmetrized Bregman centroid [T2] . 

To overcome the explicit computation of the Jeffreys centroid, Nock et al. [H] generalized 
the Bregman /c-means [S] and /c-means+-|- seeding using mixed Bregman divergences: They 
consider two dual centroids c m and c* m attached per cluster, and use the following divergence 
depending on these two centers: AKL(c m : x : c^J = KL(c m : x) + KL(x : c^J. However, 
note that this mixed Bregman 2-centroid-per-cluster clustering is different from the Jeffreys 
/c-means clustering that relies on one centroid per cluster. 



3 



This letter is organized as follows: Section [2] reports a closed-form expression of the 
positive Jeffreys centroid for a set of positive histograms. Section [3] studies the guaranteed 
tight approximation factor obtained when normalizing the positive Jeffreys centroid, and 
further describes a simple bisection algorithm to arbitrarily finely approximate the optimal 
Jeffreys frequency centroid. Section H] reports on our experimental results that show that our 
normalized approximation is in practice tight enough to avoid doing the bisection process. 
Finally, Section concludes this work. 

2 Jeffreys positive centroid 

We consider a set % = {hi, h n } of n positive weighted histograms with d non-empty bins 
(hj G K+, rtj > and ^ ttj = 1). The Jeffreys positive centroid c is defined by: 

n 

c = arg min J{Ji, x) = arg min > iTjJ(hj, x). (3) 

+ + J= l 

We state the first result: 

Theorem 1 The Jeffreys positive centroid c = (c 1 , c d ) of a set {hi, h n } of n weighted 
positive histograms with d bins can be calculated component-wise exactly using the Lambert W 
analytic function: d = - , where a % = Y^=i ^jh) denotes the coordinate-wise arithmetic 

weighted means and g l = Yi^ii^j)^ the coordinate-wise geometric weighted means. 

Proof We seek for x G IR+ that minimizes Eq. |3j After expanding Jeffreys divergence 
formula of Eq. [T] in Eq. |3] and removing all additive terms independent of x, we find the 
following equivalent minimization problem: 

d i 

min N x l log — — a 1 log x 1 . 

+ i=i 

This optimization can be performed coordinate-wise, independently. For each coordinate, 
dropping the superscript notation and setting the derivative to zero, we have to solve log | + 
1 — ^ = 0, which yields x = w "a e ^ , where W(-) denotes the Lambert W function [T4"] . 

Lambert functiorjl W is defined by W(x)e w ^ = x for x > 0. That is, the Lambert 
function is the functional inverse of f(x) = xe x = y: x = W(y). Although function W may 
seem non-trivial at first sight, it is a popular elementary analytic function similar to the 
logarithm or exponential functions. In practice, we get a fourth-order convergence algorithm 
to estimate it by implementing Halley's numerical root-finding method. It requires fewer 
than 5 iterations to reach machine accuracy using the IEEE 754 floating point standard [14J. 
Notice that the Lambert W function plays a particular role in information theory [15J. 

2 We consider only the branch Wq [14] since arguments of the function are always positive. 
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3 Jeffreys frequency centroid 

We consider a set H of n frequency histograms: H = {h\, h n }. 
3.1 A guaranteed approximation 

If we relax x to the positive orthant instead of the probability simplex, we get the optimal 
positive Jeffreys centroid c, with d = — ^ — (Theorem 1). Normalizing this positive Jeffreys 

centroid to get c' = does not yield the Jeffreys frequency centroid c that requires dedicated 
iterative optimization algorithms [TTj [T2] . In this section, we consider approximations of the 
Jeffreys frequency histograms. Veldhuis [TTJ approximated the Jeffreys frequency centroid 
c by c" = where a and g denotes the normalized weighted arithmetic and geometric 
means, respectively. The normalized geometric weighted mean g = (g 1 , ...,g d ) is defined by 

~ gi = J2Unt%^ J ' » e f 1 ' Since ^ii ^i=i n iH = Sf=i Eti ^ = E"=i = 1, 

the normalized arithmetic weighted mean has coordinates: a 1 = Yl^i^jh)- 

We consider approximating the Jeffreys frequency centroid by normalizing the Jeffreys 
positive centroid c: c' = — . 

We start with a simple lemma: 

Lemma 1 The cumulative sum w c of the bin values of the Jeffreys positive centroid c of a 
set of frequency histograms is less or equal to one: < w c < 1. 

Proof Consider the frequency histograms H as positive histograms. It follows from Theo- 
rem 1 that the Jeffreys positive centroid c is such that w c = Yli=i ° l = Y^t=i — ~i — • Now, 

the arithmetic-geometric mean inequality states that a 1 > g % where a 1 and g % denotes the 
coordinates of the arithmetic and geometric positive means. Therefore W(^e) > 1 and 

c % < a\ Thus w c = J2t=i c% ^ Yli=i a 1 = I- 

We consider approximating Jeffreys frequency centroid on the probability simplex by 
using the normalization of the Jeffreys positive centroid: c' = — ^-j—, with w = Y^f , — ^i— . 

To study the quality of this approximation, we use the following lemma: 

Lemma 2 For any histogram x and frequency histogram h, we have J(x, h) = J(x, h) + 
(w x — l)(KL(x : h) + \ogw x ), where w x denotes the normalization factor (w x = Ylt=i x% )- 

Proof It follows from the definition of Jeffreys divergence and the fact that x % = w x x l that 
J(x, h) = Ylt=i( w xS ;l ~ h l ) log ^fj~- Expanding and mathematically rewriting the rhs. yields 

J(x,h) = ^ =1 (w x x i \og^ + w x xHogw x + hHog^-h i \ogw x ) = (w x - 1) \ogw x + J(x, h) + 

(w x -l)X]? = i^logf^ = J(x,h) + (w x -l)(KL(x : h)+\ogw x ), since Y0t=i ^ = Z*=i 5< = L 
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The lemma can be extended to a set of weighted frequency histograms %: 

J(x,H) = J(x,H) + (w x - l)(KL(x : H) + \ogw x ), 

where J(x, if) = Y^j=i ^jJi^x, hf) and KL(x : if) = YTj=i 7 i'jKL(x, hj) (with YTj=i = !)• 
We state the second theorem concerning our guaranteed approximation: 

Theorem 2 Let c denote the Jeffreys frequency centroid and cf — ^- the normalized Jeffreys 
positive centroid. Then the approximation factor otg = is such that 1 < a? < 1 + 

< A. ( mt hw c <l). 

K w c I J(c,H) — w c * c — / 

Proof We have J(c,H) < J(c,H) < J(c',H). Using Lemma 2, since J(c',H) = 
J(c,H) + (1 - w c )(KL(cf,H) + logw c )) and J(c,H) < J(c,H), it follows that 1 < a £ > < 
1 + (i-u, c )(KL(&,H)+io g u, c ) _ We algo haye KL ^ f . ^ = J_ KL ( C) _ logWc ( by expanding the 

KL expression and using the fact that w c = Y2i Therefore < 1 + ( l ~ Wc )^ h ^> H ) m Since 
J(c, if) > J(c, if) and KL(c, if) < J(c, if), we finally obtain ay < 

When w c = 1 the bound is tight. Experimental results described in the next section shows 
that this normalized Jeffreys positive centroid c' almost coincide with the Jeffreys frequency 
centroid. 



3.2 Arbitrary fine approximation by bisection search 

It has been shown in [TTJ [12] that minimizing Eq. [2] over the probability simplex amounts 
to minimize the following equivalent problem: 

c = arg min KL(a : x) + KL(x : g), (4) 

Nevertheless, instead of using the two- nested loops of Veldhuis' Newton scheme we can 
design a single loop optimization algorithm. We consider the Lagrangian function obtained 
by enforcing the normalization constraint £\ c l = 1 similar to [TTJ. For each coordinate, 
setting the derivative with respect to c\ we get log ^ + 1 — C + A = 0, which solves as 
^ = — / a f e A+i\ • By multiplying these d constraints with c* respectively and summing up, 

V g l J 

we deduce that A = — KL(c : g) < (also noticed by [Hj). From the constraints that 
all Cj's should be less than one, we bound A as follows: c l = — ,?\ , , \ < 1, which solves 

for equality when A = log(e al g*) — 1. Thus we seek for A G [maxj log(e a ^*) — 1,0]. Since 
s = c* = 1, we have the following cumulative sum equation depending on the unknown 
parameter A: s(A) = ^\ c*(A) = J2i=i — / a f e \ + i\ ■ This is a monotonously decreasing function 

V 3 l J 

with s(0) < 1. We can thus perform a simple bisection search to approximate the optimal 
value of A, and therefore deduce an arbitrary fine approximation of the Jeffreys frequency 
centroid. 
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a c (opt. positive) 


ac'(n'lized approx.) 


w c < l(n'lizing coeff.) 


a?' (Veldhuis) 


avg 


0.9648680345638155 


1.0002205080964255 


0.9338228644308926 


1.065590178484613 


min 


0.906414219584823 


1.0000005079528809 


0.8342819488534723 


1.0027707382095195 


max 


0.9956399220678585 


1.0000031489541772 


0.9931975105809021 


1.3582296675397754 



Table 1: Experimental performance ratio and statistics for the 30000+ images of the Caltech- 
256 database. Observe that a c = j^'-j — 1 since the positive Jeffreys centroid (available in 
closed-form) minimizes the average Jeffreys divergence criterion. Our guaranteed normalized 
approximation c' is almost optimal. Veldhuis' simple half normalized arithmetic-geometric 
approximation performs on average with a 6.56% error but can be far from the optimal in 
the worst-case (35.8%). 

4 Experimental results and discussion 

We used a multi-precision floating point flhttp: //www. apf loat . org/[ ) package to handle 
calculations and control arbitrary precisions. We chose the Caltech-256 database [16] con- 
sisting of 30607 images labeled into 256 categories to perform experiments: We consider 
the set of intensity^ histograms %. For each of the 256 category, we consider the set of 
histograms falling inside this category and compute the exact Jeffreys positive centroids c, 
its normalization Jeffreys approximation c' and optimal frequency centroids c. We also con- 
sider the average of the arithmetic and geometric normalized means c" = ^jp. We evaluate 
the average, minimum and maximum ratio a x = for x G {c, c',c"}. The results are 

reported in Tabled! Furthermore, to study the best /worst /average performance of the the 
normalized Jeffreys positive centroid c', we ran 10 6 trials as follows: We draw two random 
binary histograms (d = 2), calculate a fine precision approximation of c using numerical op- 
timization, and calculate the approximation obtained by using the normalized closed-form 
centroid c'. We gather statistics on the ratio a = > 1. We find experimentally the 
following performance: a ~ 1.0000009, cw ~ 1.00181506, a min = 1.000000. Although ~d is 
almost matching c in those two real-world and synthetic experiments, it remains open to 
express analytically and exactly its worst-case performance. 

5 Conclusion 

We summarize the two main contributions of this paper: (1) we proved that the Jeffreys 
positive centroid admits a closed-form formula expressed using the Lambert W function, 
and (2) we proved that normalizing this Jeffreys positive centroid yields a tight guaran- 
teed approximation to the Jeffreys frequency centroid. We noticed experimentally that the 
closed-form normalized Jeffreys positive centroid almost coincide with the Jeffreys frequency 

Converting RGB color pixels to 0.3R + 0.596G + 0.11B I grey pixels. 
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centroid, and can therefore be used in Jeffreys £;-means clustering. Notice that since the k- 
means assignment /relocate algorithm monotonically converges even if instead of computing 
the exact cluster centroids we update it with provably better centroids (i.e., by applying one 
bisection iteration of Jeffreys frequency centroid computation), we end up with a converging 
variational Jeffreys frequency &-means that requires to implement a stopping criterion. Jef- 
freys divergence is not the only way to symmetrize the Kullback-Leibler divergence. Other 
KL symmetrizations include the Jensen-Shannon divergence [5J, the Chernoff divergence [5J, 
and a smooth family of symmetric divergences including the Jensen- Shannon and Jeffreys 
divergences [T7j|. 
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